Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize ROW initialization #15501

Merged
merged 2 commits into from
Jun 15, 2023
Merged

Conversation

lhecker
Copy link
Member

@lhecker lhecker commented Jun 2, 2023

Performance of printing enwik8.txt at the following block sizes:
4KiB (printf): 51MB/s -> 54MB/s
128KiB (cat): 92MB/s -> 103MB/s

Validation Steps Performed

  • Rows are properly filled with whitespace at various
    window sizes as observed under a debugger ✅

@lhecker lhecker added Product-Conhost For issues in the Console codebase Area-Performance Performance-related issue labels Jun 2, 2023
@github-actions

This comment has been minimized.

@lhecker lhecker force-pushed the dev/lhecker/vt-perf5 branch 2 times, most recently from b1321d2 to a919562 Compare June 5, 2023 16:38
@lhecker lhecker changed the title Initialize rows lazily Vectorize ROW initialization Jun 5, 2023
@github-actions

This comment has been minimized.

src/buffer/out/Row.cpp Outdated Show resolved Hide resolved
src/buffer/out/Row.cpp Outdated Show resolved Hide resolved
// implement Reset() efficiently via SIMD and the latter is used to store the past-the-end offset
// into the `charsBuffer`. Even though the `charsBuffer` could be only `rowWidth` large we need them
// to be the same size so that the SIMD code can process both arrays in the same loop simultaneously.
// This wastes up to 5.8% memory but increases overall scrolling performance by around 40%.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye. SIMD is free real estate in our CPUs - Let's use it.

@DHowett
Copy link
Member

DHowett commented Jun 9, 2023

Why does vectorization improve performance printing text so much? I thought that all rows were eagerly initialized before your recent work.

Is this literally all from reinitializing individual rows as we recycle/circle them?

@lhecker
Copy link
Member Author

lhecker commented Jun 9, 2023

Is this literally all from reinitializing individual rows as we recycle/circle them?

Yes, pretty much. It reduces the row initialization cost from around 80ns per 120 columns down to 5ns. OpenConsole with all these recent changes included processes something around 1.7M rows per second, so that's why it has such a big impact. 1.7M sounds like a lot, but it runs fairly close to spending an entire millisecond just initializing the text buffer on startup, whereas this new code won't even really show up in perf traces anymore. (And that other PR will make it a non-issue.)
Not clearing rows at all before writing them would also be nice, for instance by maintaining the "end of the row" column, and simply leaving the rest of the row as uninitialized memory. But I believe that doing this in a robust way is a long way out whereas this was fairly easy to implement, tune and test within about an hour.


// Fills _charsBuffer with whitespace and correspondingly _charOffsets
// with successive numbers from 0 to _columnCount+1.
#if defined(TIL_SSE_INTRINSICS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw nobody ever sets this to true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is part of #15498. I can pull that change into this branch and rebase it on main so we can merge it immediately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love that! I'm personally OK merging these out of order, even if it means that the numbers in the perf discussion part are incorrect.

} while (chars < charsEndLoop);
}

_mm256_storeu_si256(reinterpret_cast<__m256i*>(charsEndLoop), whitespace);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so wait, this will write up to 15 things off the end of the buffer? and there's no risk that this is going to stomp anything important?

Like, if the buffer is 17 columns wide... the char offsets buffer starts at alignment 16 from the end of the chars buffer, and the next ROW starts at alignment 16 from the end of the char offsets buffer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It writes up to 15 bytes off the end of the buffer, which at a granularity of wchar_t is up to 7 items. It won't stomp anything due to our alignment guarantees in the buffer, which ensures that all buffers start at a 16-byte aligned offset and end on one. If we ever determine that this alignment is unneeded for our performance goals, there's a few techniques we can use to avoid writing outside of the buffer, the most common being that you write the remaining N items in a simple for loop.

@lhecker lhecker changed the base branch from dev/lhecker/vt-perf4 to main June 13, 2023 23:16
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@lhecker lhecker marked this pull request as ready for review June 13, 2023 23:34
@lhecker
Copy link
Member Author

lhecker commented Jun 13, 2023

It was a bit of a messy rebase, but I think it's ready now.

offsets = _mm_add_epi16(offsets, increment);
chars += 8;
charOffsets += 8;
// If _columnCount is something like 120, the actual backing buffer for charOffsets is 121 items large.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see, i guess this is the part that scares me. every time we talk about the width of the backing buffers we're like, "YUP it's always +1" when in truth it is up to +16 or +32 or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add an "at least" in there when I merge main in. (It's only up to +8 btw.)

@lhecker lhecker added the AutoMerge Marked for automatic merge by the bot when requirements are met label Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Performance Performance-related issue AutoMerge Marked for automatic merge by the bot when requirements are met Product-Conhost For issues in the Console codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants